Credit Card Fraud Detection

Introduction :

There have always been people who would develop new ways to illegally access someone's finances since the payment systems were invented. This has become a huge issue in the modern era, because all purchases can be made online with just your credit card information. Even before two-step verification was employed for online purchasing in the United States in the 2010s, many American retail website users were victims of online transaction fraud. When a data breach results in monetary theft and, as a result, the loss of customers' loyalty and the company's reputation, it puts organisations, consumers, banks, and merchants at danger.

In 2017, unauthorised card operations claimed the lives of 16.7 million people. Furthermore, according to the Federal Trade Commission (FTC), credit card fraud claims increased by 40% in 2017 compared to the previous year. Around 13,000 incidents were reported in California and 8,000 in Florida, the two states with the highest per capita rates of this sort of crime. The amount of money at stake will exceed approximately $30 billion by 2020.

Here are some credit card fraud statistics:

image source : https://spd.group/machine-learning/credit-card-fraud-detection/

What is Credit Card Fraud Detection?

“A series of operations conducted to prevent money or property from being gained under false pretences is known as fraud detection.”

Fraud can be committed in a variety of ways and in a wide range of industries. To make a decision, the majority of detection systems combine a number of fraud detection datasets to create a connected picture of both legitimate and invalid payment data. This decision must consider IP address, geolocation, device identification, “BIN” data, global latitude/longitude, historic transaction patterns, and the actual transaction information. In practice, this means that merchants and issuers deploy analytically based responses that use internal and external data to apply a set of business rules or analytical algorithms to detect fraud.

Credit Card Fraud Detection with Machine Learning is a process of data investigation by a Data Science team and the development of a model that will provide the best results in revealing and preventing fraudulent transactions. This is achieved through bringing together all meaningful features of card users’ transactions, such as Date, User Zone, Product Category, Amount, Provider, Client’s Behavioral Patterns, etc. The information is then run through a subtly trained model that finds patterns and rules so that it can classify whether a transaction is fraudulent or is legitimate.

Clone transactions :

Cloning a transaction is often a common method of performing transactions similar to the original transaction or replicating a transaction. This can happen when an organization tries to receive payments from a partner multiple times by sending the same invoice to different departments.

Conventional rule-based fraud detection algorithm does not work well to distinguish a fraudulent transaction from an erroneous or erroneous one. For example, a user might accidentally click the submit button twice or order the same product twice. The best option is if the system can distinguish a fraudulent transaction from a transaction made in error. Here machine learning techniques will be more effective in distinguishing cloning transactions caused by human error from real fraud.

Account theft and suspicious transactions :

When an individual’s private facts inclusive of a Social Security variety, a mystery query answer, or date of beginning is stolen with the aid of using criminals, they are able to use this facts to carry out monetary operations. A lot of fraudulent transactions are related to identification theft, so monetary fraud prevention structures must pay the maximum interest to growing an evaluation of a user’s conduct.

If there may be a sure regularity withinside the manner a purchaser makes his bills, e. g. a person visits a sure bar as soon as per week on the identical time and usually spends about $ 40 to $ 60. If the identical account is used to make a price at a bar positioned in some other a part of city and for a sum of extra than $60, this conduct might be taken into consideration abnormal. The subsequent circulate might be to ship a verification request to the cardboard variety proprietor in an effort to validate that she or he made the transaction.

Metrics inclusive of fashionable deviation, averages, and excessive/low values are the maximum beneficial to identify abnormal conduct. Separate bills are in comparison with private benchmarks to become aware of transactions with a excessive fashionable deviation. Then, the first-class desire is to validate the account holder if the sort of deviation occurs.

False application fraud :

App fraud is often accompanied by account / identity theft. This means that someone is applying to open a new credit account or credit card in a different name. First, criminals steal documents that will serve as proof of your fake claim.

Anomaly detection helps determine if a transaction has abnormal patterns such as date and time or quantity of items. If the algorithm detects this unusual behavior, the bank account holder is protected by several verification methods.

Credit Card Skimming (electronic or manual)

Credit card theft means illegal copying of a credit or bank card using a device that reads and copies information from the original card. Fraudsters use machines called "skimmers" to extract card numbers and other information about credit cards, store them, and resell them to criminals.

As with identity theft, suspicious transactions made with an electronic card or manual copies will be disclosed as transaction information. Classification techniques can be used to determine whether a transaction is fraudulent or not based on equipment, geographic location, and information. about customer behavior models.

Account takeover :

Fraudsters can send phishing emails to cardholders. Messages appear perfectly legitimate (like a very similar bank URL and a trustworthy logo) as if they were sent by a bank. Online number and password. If you click the wrong link or provide valuable information in response to a post from a fake banking website, attackers will empty your bank account into the one they have within a few days.

To avoid this fraudulent scheme, artificial intelligence solutions rely on neural networks or pattern recognition. Neural networks can learn suspicious patterns as well as detect classes and groups in order to use these patterns to detect fraud.

How Does Credit Card Fraud Happen?

Credit card fraud is commonly induced both through card owner’s negligence together along with his statistics or through a breach in a website’s security.
Here are a few examples:

If your card is lost or stolen, unauthorized debiting of funds may occur; That is, the person who finds it uses it to make a purchase. Criminals can also spoof your name and use a card or order certain items through a mobile phone or computer. There is also the problem of using counterfeit credit cards - counterfeit cards with real account information that have been stolen from the cardholder. This is especially dangerous because the victim has their real card, but they don't know that someone copied their card. These fraudulent cards look legitimate and have a logo and a magnetic code on them. original stripe.Fraudulent credit cards are often destroyed by criminals after several successful payments, shortly before the victim realizes the problem and reports it.

Business Challenge :

Detecting fraud transactions is of great importance for any credit card company. We are tasked by a well-known company to detect potential frauds so that customers are not charged for items that they did not purchase.

So the goal is to build a classifier that tells if a transaction is a fraud or not.


The challenge is to recognize fraudulent credit card transactions so that the customers of credit card companies are not charged for items that they did not purchase.

Main challenges involved in credit card fraud detection are:
  1. Enormous Data is processed every day and the model build must be fast enough to respond to the scam in time.
  2. Imbalanced Data i.e most of the transactions (99.8%) are not fraudulent which makes it really hard for detecting the fraudulent ones.
  3. Data availability as the data is mostly private.
  4. Misclassified Data can be another major issue, as not every fraudulent transaction is caught and reported.
  5. Adaptive techniques used against the model by the scammers.
How to tackle these challenges?
  1. The model used must be simple and fast enough to detect the anomaly and classify it as a fraudulent transaction as quickly as possible.
  2. Imbalance can be dealt with by properly using some methods.
  3. For protecting the privacy of the user the dimensionality of the data can be reduced.
  4. A more trustworthy source must be taken which double-check the data, at least for training the model.
  5. We can make the model simple and interpretable so that when the scammer adapts to it with just some tweaks we can have a new model up and running to deploy.

Description of Dataset :

Dataset Source :

https://www.kaggle.com/mlg-ulb/creditcardfraud

Importing Libraries

Importing the Dataset

Summary of Data

Total Unique value

Total Missing values

Exploratory Data Analysis

Outliers treatment

Train/Test Split

Exploratory Data Analysis

1. Time

2. Amount

3. Time vs. Amount

4. V1 - V28

    The histogram doesn't show us outliers. 
    Let's try a boxplot:

5. Mutual Information between Fraud and the Predictors

Mutual information is a non-parametric method to estimate the mutual dependence between two variables. Mutual information of 0 indicates no dependence, and higher values indicate higher dependence.

According to the sklearn User Guide, "mutual information methods can capture any kind of statistical dependency, but being nonparametric, they require more samples for accurate estimation."

We have 227,845 training samples, so mutual information should work well. Because the target variable is discrete, we use mutual_info_classif (as opposed to mutual_info_regression for a continuous target).

6. Modeling

Logistic Regression and Support Vector Classifier

Random Forest

Test Set Evaluation of the Best Model

Conclusion